N-gram-based Text Attribution
نویسنده
چکیده
Quantitative authorship attribution refers to the task of identifying the author of a text based on measurable features of the author’s style—a problem that has practical application in areas as diverse as literary scholarship, plagiarism detection, and criminal forensics. Attribution methods generally follow a generative approach, wherein a statistical “profile” is created for a set of candidate authors, based on certain features of the authors’ writings, and the author whose profile most closely resembles the corresponding features of the unclassified text is selected. Potential features include word and sentence lengths, letter frequencies, word frequencies, vocabulary richness, word collocations, and more sophisticated (but not necessarily more useful) patterns that appear after syntactic tagging.
منابع مشابه
On the Robustness of Authorship Attribution Based on Character N-gram Features
A number of independent authorship attribution studies have demonstrated the effectiveness of character n-gram features for representing the stylistic properties of text. However, the vast majority of these studies examined the simple case where the training and test corpora are similar in terms of genre, topic, and distribution of the texts. Hence, there are doubts whether such a simple and lo...
متن کاملN-Gram Based Authorship Attribution in Urdu Poetry
Authorship attribution is an interesting problem in Computational Linguistics. Traditional author recognition systems for electronic text rely on techniques which train the system to the specific vocabulary and writing style of the writer and apply stochastic methods to judge a given text at byte, letter or word levels. In this paper we have developed a software system to apply one existing and...
متن کاملAuthorship Attribution in Bengali Language
We describe Authorship Attribution of Bengali literary text. Our contributions include a new corpus of 3,000 passages written by three Bengali authors, an end-toend system for authorship classification based on character n-grams, feature selection for authorship attribution, feature ranking and analysis, and learning curve to assess the relationship between amount of training data and test accu...
متن کاملDomain Specific Author Attribution based on Feedforward Neural Network Language Models
Authorship attribution refers to the task of automatically determining the author based on a given sample of text. It is a problem with a long history and has a wide range of application. Building author profiles using language models is one of the most successful methods to automate this task. New language modeling methods based on neural networks alleviate the curse of dimensionality and usua...
متن کاملAuthor Identification Using Different Sizes of Documents: A Summary
In the present research work, we deal with the problem of authorship attribution of ancient Arabic text documents, which were written by several ancient philosophers. For that purpose, we conducted several authorship attribution experiments applied with different text sizes. A special dataset, called “A4P” (Authorship Attribution for Ancient Arabic Philosophers), has been constructed by extract...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009